Computation and Language 39
☆ Towards Explainable AI Writing Assistants for Non-native English Speakers
We highlight the challenges faced by non-native speakers when using AI
writing assistants to paraphrase text. Through an interview study with 15
non-native English speakers (NNESs) with varying levels of English proficiency,
we observe that they face difficulties in assessing paraphrased texts generated
by AI writing assistants, largely due to the lack of explanations accompanying
the suggested paraphrases. Furthermore, we examine their strategies to assess
AI-generated texts in the absence of such explanations. Drawing on the needs of
NNESs identified in our interview, we propose four potential user interfaces to
enhance the writing experience of NNESs using AI writing assistants. The
proposed designs focus on incorporating explanations to better support NNESs in
understanding and evaluating the AI-generated paraphrasing suggestions.
comment: CHI In2Writing Workshop 2023 camera-ready version
☆ Beyond Summarization: Designing AI Support for Real-World Expository Writing Tasks
Zejiang Shen, Tal August, Pao Siangliulue, Kyle Lo, Jonathan Bragg, Jeff Hammerbacher, Doug Downey, Joseph Chee Chang, David Sontag
Large language models have introduced exciting new opportunities and
challenges in designing and developing new AI-assisted writing support tools.
Recent work has shown that leveraging this new technology can transform writing
in many scenarios such as ideation during creative writing, editing support,
and summarization. However, AI-supported expository writing--including
real-world tasks like scholars writing literature reviews or doctors writing
progress notes--is relatively understudied. In this position paper, we argue
that developing AI supports for expository writing has unique and exciting
research challenges and can lead to high real-world impacts. We characterize
expository writing as evidence-based and knowledge-generating: it contains
summaries of external documents as well as new information or knowledge. It can
be seen as the product of authors' sensemaking process over a set of source
documents, and the interplay between reading, reflection, and writing opens up
new opportunities for designing AI support. We sketch three components for AI
support design and discuss considerations for future research.
comment: 3 pages, 1 figure, accepted by The Second Workshop on Intelligent and
Interactive Writing Assistants
☆ Human-like Summarization Evaluation with ChatGPT
Evaluating text summarization is a challenging problem, and existing
evaluation metrics are far from satisfactory. In this study, we explored
ChatGPT's ability to perform human-like summarization evaluation using four
human evaluation methods on five datasets. We found that ChatGPT was able to
complete annotations relatively smoothly using Likert scale scoring, pairwise
comparison, Pyramid, and binary factuality evaluation. Additionally, it
outperformed commonly used automatic evaluation metrics on some datasets.
Furthermore, we discussed the impact of different prompts, compared its
performance with that of human evaluation, and analyzed the generated
explanations and invalid responses.
comment: 9 pages, 5 figures, in process
☆ PWESuite: Phonetic Word Embeddings and Tasks They Facilitate
Vilém Zouhar, Kalvin Chang, Chenxuan Cui, Nathaniel Carlson, Nathaniel Robinson, Mrinmaya Sachan, David Mortensen
Word embeddings that map words into a fixed-dimensional vector space are the
backbone of modern NLP. Most word embedding methods encode semantic
information. However, phonetic information, which is important for some tasks,
is often overlooked. In this work, we develop several novel methods which
leverage articulatory features to build phonetically informed word embeddings,
and present a set of phonetic word embeddings to encourage their community
development, evaluation and use. While several methods for learning phonetic
word embeddings already exist, there is a lack of consistency in evaluating
their effectiveness. Thus, we also proposes several ways to evaluate both
intrinsic aspects of phonetic word embeddings, such as word retrieval and
correlation with sound similarity, and extrinsic performances, such as rhyme
and cognate detection and sound analogies. We hope that our suite of tasks will
promote reproducibility and provide direction for future research on phonetic
word embeddings.
☆ Evaluation of ChatGPT Family of Models for Biomedical Reasoning and Classification
Shan Chen, Yingya Li, Sheng Lu, Hoang Van, Hugo JWL Aerts, Guergana K. Savova, Danielle S. Bitterman
Recent advances in large language models (LLMs) have shown impressive ability
in biomedical question-answering, but have not been adequately investigated for
more specific biomedical applications. This study investigates the performance
of LLMs such as the ChatGPT family of models (GPT-3.5s, GPT-4) in biomedical
tasks beyond question-answering. Because no patient data can be passed to the
OpenAI API public interface, we evaluated model performance with over 10000
samples as proxies for two fundamental tasks in the clinical domain -
classification and reasoning. The first task is classifying whether statements
of clinical and policy recommendations in scientific literature constitute
health advice. The second task is causal relation detection from the biomedical
literature. We compared LLMs with simpler models, such as bag-of-words (BoW)
with logistic regression, and fine-tuned BioBERT models. Despite the excitement
around viral ChatGPT, we found that fine-tuning for two fundamental NLP tasks
remained the best strategy. The simple BoW model performed on par with the most
complex LLM prompting. Prompt engineering required significant investment.
comment: 28 pages, 2 tables and 4 figures. Submitting for review
☆ Quantifying the Roles of Visual, Linguistic, and Visual-Linguistic Complexity in Verb Acquisition
Children typically learn the meanings of nouns earlier than the meanings of
verbs. However, it is unclear whether this asymmetry is a result of complexity
in the visual structure of categories in the world to which language refers,
the structure of language itself, or the interplay between the two sources of
information. We quantitatively test these three hypotheses regarding early verb
learning by employing visual and linguistic representations of words sourced
from large-scale pre-trained artificial neural networks. Examining the
structure of both visual and linguistic embedding spaces, we find, first, that
the representation of verbs is generally more variable and less discriminable
within domain than the representation of nouns. Second, we find that if only
one learning instance per category is available, visual and linguistic
representations are less well aligned in the verb system than in the noun
system. However, in parallel with the course of human language development, if
multiple learning instances per category are available, visual and linguistic
representations become almost as well aligned in the verb system as in the noun
system. Third, we compare the relative contributions of factors that may
predict learning difficulty for individual words. A regression analysis reveals
that visual variability is the strongest factor that internally drives verb
learning, followed by visual-linguistic alignment and linguistic variability.
Based on these results, we conclude that verb acquisition is influenced by all
three sources of complexity, but that the variability of visual structure poses
the most significant challenge for verb learning.
☆ ParroT: Translating During Chat Using Large Language Models
Large language models (LLMs) like ChatGPT and GPT-4 have exhibited remarkable
abilities on a wide range of natural language processing (NLP) tasks, including
various machine translation abilities accomplished during chat. However, these
models are only accessible through restricted APIs, which creates barriers to
new research and advancements in the field. Therefore, we propose the
$\mathbf{ParroT}$ framework to enhance and regulate the translation abilities
during chat based on open-sourced LLMs (i.e., LLaMA-7b) and human written
translation and evaluation data. Specifically, ParroT reformulates translation
data into the instruction-following style, and introduces a "Hint" field for
incorporating extra requirements to regulate the translation process.
Accordingly, we propose three instruction types for finetuning ParroT models,
including translation instruction, contrastive instruction, and error-guided
instruction. Experiments on two Flores subsets and WMT22 test sets suggest that
translation instruction improves the translation performance of vanilla LLMs
significantly while error-guided instruction can lead to a further improvement,
which demonstrates the importance of learning from low-quality translations
annotated by human. Meanwhile, the ParroT models can also preserve the ability
on general tasks with the Alpaca multi-task dataset involved in finetuning.
Codes: https://github.com/wxjiao/ParroT
comment: 9 pages; translate during chat
☆ Enhancing Multimodal Entity and Relation Extraction with Variational Information Bottleneck
This paper studies the multimodal named entity recognition (MNER) and
multimodal relation extraction (MRE), which are important for multimedia social
platform analysis. The core of MNER and MRE lies in incorporating evident
visual information to enhance textual semantics, where two issues inherently
demand investigations. The first issue is modality-noise, where the
task-irrelevant information in each modality may be noises misleading the task
prediction. The second issue is modality-gap, where representations from
different modalities are inconsistent, preventing from building the semantic
alignment between the text and image. To address these issues, we propose a
novel method for MNER and MRE by Multi-Modal representation learning with
Information Bottleneck (MMIB). For the first issue, a refinement-regularizer
probes the information-bottleneck principle to balance the predictive evidence
and noisy information, yielding expressive representations for prediction. For
the second issue, an alignment-regularizer is proposed, where a mutual
information-based item works in a contrastive manner to regularize the
consistent text-image representations. To our best knowledge, we are the first
to explore variational IB estimation for MNER and MRE. Experiments show that
MMIB achieves the state-of-the-art performances on three public benchmarks.
☆ Personality-aware Human-centric Multimodal Reasoning: A New Task
Multimodal reasoning, an area of artificial intelligence that aims at make
inferences from multimodal signals such as vision, language and speech, has
drawn more and more attention in recent years. People with different
personalities may respond differently to the same situation. However, such
individual personalities were ignored in the previous studies. In this work, we
introduce a new Personality-aware Human-centric Multimodal Reasoning
(Personality-aware HMR) task, and accordingly construct a new dataset based on
The Big Bang Theory television shows, to predict the behavior of a specific
person at a specific moment, given the multimodal information of its past and
future moments. The Myers-Briggs Type Indicator (MBTI) was annotated and
utilized in the task to represent individuals' personalities. We benchmark the
task by proposing three baseline methods, two were adapted from the related
tasks and one was newly proposed for our task. The experimental results
demonstrate that personality can effectively improve the performance of
human-centric multimodal reasoning. To further solve the lack of personality
annotation in real-life scenes, we introduce an extended task called
Personality-predicted HMR, and propose the corresponding methods, to predict
the MBTI personality at first, and then use the predicted personality to help
multimodal reasoning. The experimental results show that our method can
accurately predict personality and achieves satisfactory multimodal reasoning
performance without relying on personality annotations.
☆ Disentangling Structure and Style: Political Bias Detection in News by Inducing Document Hierarchy
We address an important gap in detection of political bias in news articles.
Previous works that perform supervised document classification can be biased
towards the writing style of each news outlet, leading to overfitting and
limited generalizability. Our approach overcomes this limitation by considering
both the sentence-level semantics and the document-level rhetorical structure,
resulting in a more robust and style-agnostic approach to detecting political
bias in news articles. We introduce a novel multi-head hierarchical attention
model that effectively encodes the structure of long documents through a
diverse ensemble of attention heads. While journalism follows a formalized
rhetorical structure, the writing style may vary by news outlet. We demonstrate
that our method overcomes this domain dependency and outperforms previous
approaches for robustness and accuracy. Further analysis demonstrates the
ability of our model to capture the discourse structures commonly used in the
journalism domain.
comment: Preprint. Under review
☆ Ericson: An Interactive Open-Domain Conversational Search Agent
Open-domain conversational search (ODCS) aims to provide valuable, up-to-date
information, while maintaining natural conversations to help users refine and
ultimately answer information needs. However, creating an effective and robust
ODCS agent is challenging. In this paper, we present a fully functional ODCS
system, Ericson, which includes state-of-the-art question answering and
information retrieval components, as well as intent inference and dialogue
management models for proactive question refinement and recommendations. Our
system was stress-tested in the Amazon Alexa Prize, by engaging in live
conversations with thousands of Alexa users, thus providing empirical basis for
the analysis of the ODCS system in real settings. Our interaction data analysis
revealed that accurate intent classification, encouraging user engagement, and
careful proactive recommendations contribute most to the users satisfaction.
Our study further identifies limitations of the existing search techniques, and
can serve as a building block for the next generation of ODCS agents.
comment: pre-print
☆ Large Language Models as Master Key: Unlocking the Secrets of Materials Science with GPT
Tong Xie, Yuwei Wa, Wei Huang, Yufei Zhou, Yixuan Liu, Qingyuan Linghu, Shaozhou Wang, Chunyu Kit, Clara Grazian, Bram Hoex
This article presents a new NLP task called structured information inference
(SIS) to address the complexities of information extraction at the device level
in materials science. We accomplished this task by finetuning GPT-3 on a
exsiting perovskite solar cell FAIR dataset with 91.8 F1-score and we updated
the dataset with all related scientific papers up to now. The produced dataset
is formatted and normalized, enabling its direct utilization as input in
subsequent data analysis. This feature will enable materials scientists to
develop their own models by selecting high-quality review papers within their
domain. Furthermore, we designed experiments to predict PCE and reverse-predict
parameters and obtained comparable performance with DFT, which demonstrates the
potential of large language models to judge materials and design new materials
like a materials scientist.
☆ Document-Level Machine Translation with Large Language Models
Large language models (LLMs) such as Chat-GPT can produce coherent, cohesive,
relevant, and fluent answers for various natural language processing (NLP)
tasks. Taking document-level machine translation (MT) as a testbed, this paper
provides an in-depth evaluation of LLMs' ability on discourse modeling. The
study fo-cuses on three aspects: 1) Effects of Discourse-Aware Prompts, where
we investigate the impact of different prompts on document-level translation
quality and discourse phenomena; 2) Comparison of Translation Models, where we
compare the translation performance of Chat-GPT with commercial MT systems and
advanced document-level MT methods; 3) Analysis of Discourse Modelling
Abilities, where we further probe discourse knowledge encoded in LLMs and
examine the impact of training techniques on discourse modeling. By evaluating
a number of benchmarks, we surprisingly find that 1) leveraging their powerful
long-text mod-eling capabilities, ChatGPT outperforms commercial MT systems in
terms of human evaluation. 2) GPT-4 demonstrates a strong ability to explain
discourse knowledge, even through it may select incorrect translation
candidates in contrastive testing. 3) ChatGPT and GPT-4 have demonstrated
superior performance and show potential to become a new and promising paradigm
for document-level translation. This work highlights the challenges and
opportunities of discourse modeling for LLMs, which we hope can inspire the
future design and evaluation of LLMs.
☆ Unleashing the Power of ChatGPT for Translation: An Empirical Study
The recently released ChatGPT has demonstrated surprising abilities in
natural language understanding and natural language generation. Machine
translation is an important and extensively studied task in the field of
natural language processing, which heavily relies on the abilities of language
understanding and generation. Thus, in this paper, we explore how to assist
machine translation with ChatGPT. We adopt several translation prompts on a
wide range of translations. Our experimental results show that ChatGPT with
designed translation prompts can achieve comparable or better performance over
professional translation systems for high-resource language translations but
lags behind significantly on low-resource translations. We further evaluate the
translation quality using multiple references, and ChatGPT achieves superior
performance compared to the professional systems. We also conduct experiments
on domain-specific translations, the final results show that ChatGPT is able to
comprehend the provided domain keyword and adjust accordingly to output proper
translations. At last, we perform few-shot prompts that show consistent
improvement across different base prompts. Our work provides empirical evidence
that ChatGPT still has great potential in translations.
☆ On the Impact of Voice Anonymization on Speech-Based COVID-19 Detection
With advances seen in deep learning, voice-based applications are burgeoning,
ranging from personal assistants, affective computing, to remote disease
diagnostics. As the voice contains both linguistic and paralinguistic
information (e.g., vocal pitch, intonation, speech rate, loudness), there is
growing interest in voice anonymization to preserve speaker privacy and
identity. Voice privacy challenges have emerged over the last few years and
focus has been placed on removing speaker identity while keeping linguistic
content intact. For affective computing and disease monitoring applications,
however, the paralinguistic content may be more critical. Unfortunately, the
effects that anonymization may have on these systems are still largely unknown.
In this paper, we fill this gap and focus on one particular health monitoring
application: speech-based COVID-19 diagnosis. We test two popular anonymization
methods and their impact on five different state-of-the-art COVID-19 diagnostic
systems using three public datasets. We validate the effectiveness of the
anonymization methods, compare their computational complexity, and quantify the
impact across different testing scenarios for both within- and across-dataset
conditions. Lastly, we show the benefits of anonymization as a data
augmentation tool to help recover some of the COVID-19 diagnostic accuracy loss
seen with anonymized data.
comment: 11 pages, 10 figures
♻ ☆ Vision Transformers are Parameter-Efficient Audio-Visual Learners CVPR 2023
Vision transformers (ViTs) have achieved impressive results on various
computer vision tasks in the last several years. In this work, we study the
capability of frozen ViTs, pretrained only on visual data, to generalize to
audio-visual data without finetuning any of its original parameters. To do so,
we propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained
ViTs to audio-visual tasks by injecting a small number of trainable parameters
into every layer of a frozen ViT. To efficiently fuse visual and audio cues,
our LAVISH adapter uses a small set of latent tokens, which form an attention
bottleneck, thus, eliminating the quadratic cost of standard cross-attention.
Compared to the existing modality-specific audio-visual methods, our approach
achieves competitive or even better performance on various audio-visual tasks
while using fewer tunable parameters and without relying on costly audio
pretraining or external audio encoders. Our code is available at
https://genjib.github.io/project_page/LAVISH/
comment: CVPR 2023 Project Page: https://genjib.github.io/project_page/LAVISH/
♻ ☆ Vision Learners Meet Web Image-Text Pairs
Most recent self-supervised learning methods are pre-trained on the
well-curated ImageNet-1K dataset. In this work, given the excellent scalability
of web data, we consider self-supervised pre-training on noisy web sourced
image-text paired data. First, we conduct a benchmark study of representative
self-supervised pre-training methods on large-scale web data in a like-for-like
setting. We compare a range of methods, including single-modal ones that use
masked training objectives and multi-modal ones that use image-text
constrastive training. We observe that existing multi-modal methods do not
outperform their single-modal counterparts on vision transfer learning tasks.
We derive an information-theoretical view to explain these benchmark results,
which provides insight into how to design a novel vision learner. Inspired by
this insight, we present a new visual representation pre-training method,
MUlti-modal Generator~(MUG), that learns from scalable web sourced image-text
data. MUG achieves state-of-the-art transfer performance on a variety of tasks
and demonstrates promising scaling properties. Pre-trained models and code will
be made public upon acceptance.
comment: Project page: https://bzhao.me/MUG/
♻ ☆ A Unified Contrastive Transfer Framework with Propagation Structure for Boosting Low-Resource Rumor Detection
The truth is significantly hampered by massive rumors that spread along with
breaking news or popular topics. Since there is sufficient corpus gathered from
the same domain for model training, existing rumor detection algorithms show
promising performance on yesterday's news. However, due to a lack of training
data and prior expert knowledge, they are poor at spotting rumors concerning
unforeseen events, especially those propagated in different languages (i.e.,
low-resource regimes). In this paper, we propose a unified contrastive transfer
framework to detect rumors by adapting the features learned from well-resourced
rumor data to that of the low-resourced. More specifically, we first represent
rumor circulated on social media as an undirected topology, and then train a
Multi-scale Graph Convolutional Network via a unified contrastive paradigm. Our
model explicitly breaks the barriers of the domain and/or language issues, via
language alignment and a novel domain-adaptive contrastive learning mechanism.
To enhance the representation learning from a small set of target events, we
reveal that rumor-indicative signal is closely correlated with the uniformity
of the distribution of these events. We design a target-wise contrastive
training mechanism with three data augmentation strategies, capable of unifying
the representations by distinguishing target events. Extensive experiments
conducted on four low-resource datasets collected from real-world microblog
platforms demonstrate that our framework achieves much better performance than
state-of-the-art methods and exhibits a superior capacity for detecting rumors
at early stages.
comment: A significant extension of the first contrastive approach for
low-resource rumor detection (arXiv:2204.08143)
♻ ☆ Identifying Mentions of Pain in Mental Health Records Text: A Natural Language Processing Approach
Pain is a common reason for accessing healthcare resources and is a growing
area of research, especially in its overlap with mental health. Mental health
electronic health records are a good data source to study this overlap.
However, much information on pain is held in the free text of these records,
where mentions of pain present a unique natural language processing problem due
to its ambiguous nature. This project uses data from an anonymised mental
health electronic health records database. The data are used to train a machine
learning based classification algorithm to classify sentences as discussing
patient pain or not. This will facilitate the extraction of relevant pain
information from large databases, and the use of such outputs for further
studies on pain and mental health. 1,985 documents were manually
triple-annotated for creation of gold standard training data, which was used to
train three commonly used classification algorithms. The best performing model
achieved an F1-score of 0.98 (95% CI 0.98-0.99).
comment: 5 pages, 2 tables, submitted to MEDINFO 2023 conference
♻ ☆ Spam-T5: Benchmarking Large Language Models for Few-Shot Email Spam Detection
This paper investigates the effectiveness of large language models (LLMs) in
email spam detection by comparing prominent models from three distinct
families: BERT-like, Sentence Transformers, and Seq2Seq. Additionally, we
examine well-established machine learning techniques for spam detection, such
as Na\"ive Bayes and LightGBM, as baseline methods. We assess the performance
of these models across four public datasets, utilizing different numbers of
training samples (full training set and few-shot settings). Our findings reveal
that, in the majority of cases, LLMs surpass the performance of the popular
baseline techniques, particularly in few-shot scenarios. This adaptability
renders LLMs uniquely suited to spam detection tasks, where labeled samples are
limited in number and models require frequent updates. Additionally, we
introduce Spam-T5, a Flan-T5 model that has been specifically adapted and
fine-tuned for the purpose of detecting email spam. Our results demonstrate
that Spam-T5 surpasses baseline models and other LLMs in the majority of
scenarios, particularly when there are a limited number of training samples
available. Our code is publicly available at
https://github.com/jpmorganchase/emailspamdetection.
♻ ☆ DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Positive-Negative Prompt-Tuning
Large-scale text-to-image generation models have achieved remarkable progress
in synthesizing high-quality, feature-rich images with high resolution guided
by texts. However, these models often struggle with novel concepts, eg, new
styles, object entities, etc. Although recent attempts have employed
fine-tuning or prompt-tuning strategies to teach the pre-trained diffusion
model novel concepts from a reference image set,they have the drawback of
overfitting to the given reference images, particularly in one-shot
applications, which is harmful to generate diverse and high-quality images
while maintaining generation controllability.
To tackle this challenge, we present a simple yet effective method called
DreamArtist, which employs a positive-negative prompt-tuning learning strategy.
Specifically, DreamArtist incorporates both positive and negative embeddings
and jointly trains them. The positive embedding aggressively captures the
salient characteristics of the reference image to drive diversified generation
and the negative embedding rectifies inadequacies from the positive embedding.
It learns not only what is correct, but also what can be avoided or improved.
We have conducted extensive experiments and evaluated the proposed method from
image similarity and diversity, generation controllability, and style cloning.
And our DreamArtist has achieved a superior generation performance over
existing methods. Besides, our additional evaluation on extended tasks,
including concept compositions and prompt-guided image editing, demonstrates
its effectiveness for more applications.
♻ ☆ Debiasing Scores and Prompts of 2D Diffusion for Robust Text-to-3D Generation CVPR 2023
The view inconsistency problem in score-distilling text-to-3D generation,
also known as the Janus problem, arises from the intrinsic bias of 2D diffusion
models, which leads to the unrealistic generation of 3D objects. In this work,
we explore score-distilling text-to-3D generation and identify the main causes
of the Janus problem. Based on these findings, we propose two approaches to
debias the score-distillation frameworks for robust text-to-3D generation. Our
first approach, called score debiasing, involves gradually increasing the
truncation value for the score estimated by 2D diffusion models throughout the
optimization process. Our second approach, called prompt debiasing, identifies
conflicting words between user prompts and view prompts utilizing a language
model and adjusts the discrepancy between view prompts and object-space camera
poses. Our experimental results show that our methods improve realism by
significantly reducing artifacts and achieve a good trade-off between
faithfulness to the 2D diffusion models and 3D consistency with little
overhead.
comment: CVPR 2023 GCV workshop
♻ ☆ AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models
Audio editing is applicable for various purposes, such as adding background
sound effects, replacing a musical instrument, and repairing damaged audio.
Recently, some diffusion-based methods achieved zero-shot audio editing by
using a diffusion and denoising process conditioned on the text description of
the output audio. However, these methods still have some problems: 1) they have
not been trained on editing tasks and cannot ensure good editing effects; 2)
they can erroneously modify audio segments that do not require editing; 3) they
need a complete description of the output audio, which is not always available
or necessary in practical scenarios. In this work, we propose AUDIT, an
instruction-guided audio editing model based on latent diffusion models.
Specifically, AUDIT has three main design features: 1) we construct triplet
training data (instruction, input audio, output audio) for different audio
editing tasks and train a diffusion model using instruction and input (to be
edited) audio as conditions and generating output (edited) audio; 2) it can
automatically learn to only modify segments that need to be edited by comparing
the difference between the input and output audio; 3) it only needs edit
instructions instead of full target audio descriptions as text input. AUDIT
achieves state-of-the-art results in both objective and subjective metrics for
several audio editing tasks (e.g., adding, dropping, replacement, inpainting,
super-resolution). Demo samples are available at https://audit-demo.github.io/.
♻ ☆ The optimality of word lengths. Theoretical foundations and an empirical study
Sonia Petrini, Antoni Casas-i-Muñoz, Jordi Cluet-i-Martinell, Mengxue Wang, Christian Bentz, Ramon Ferrer-i-Cancho
Zipf's law of abbreviation, namely the tendency of more frequent words to be
shorter, has been viewed as a manifestation of compression, i.e. the
minimization of the length of forms -- a universal principle of natural
communication. Although the claim that languages are optimized has become
trendy, attempts to measure the degree of optimization of languages have been
rather scarce. Here we present two optimality scores that are dualy normalized,
namely, they are normalized with respect to both the minimum and the random
baseline. We analyze the theoretical and statistical pros and cons of these and
other scores. Harnessing the best score, we quantify for the first time the
degree of optimality of word lengths in languages. This indicates that
languages are optimized to 62 or 67 percent on average (depending on the
source) when word lengths are measured in characters, and to 65 percent on
average when word lengths are measured in time. In general, spoken word
durations are more optimized than written word lengths in characters. Our work
paves the way to measure the degree of optimality of the vocalizations or
gestures of other species, and to compare them against written, spoken, or
signed human languages.
comment: On the one hand, the article has been reduced: analyses of the law of
abbreviation and some of the methods have been moved to another article;
appendix B has been reduced. On the other hand, various parts have been
rewritten for clarity; new figures have been added to ease the understanding
of the scores; new citations added. Many typos have been corrected
♻ ☆ Preparing the Vuk'uzenzele and ZA-gov-multilingual South African multilingual corpora EACL 2023
Richard Lastrucci, Isheanesu Dzingirai, Jenalea Rajab, Andani Madodonga, Matimba Shingange, Daniel Njini, Vukosi Marivate
This paper introduces two multilingual government themed corpora in various
South African languages. The corpora were collected by gathering the South
African Government newspaper (Vuk'uzenzele), as well as South African
government speeches (ZA-gov-multilingual), that are translated into all 11
South African official languages. The corpora can be used for a myriad of
downstream NLP tasks. The corpora were created to allow researchers to study
the language used in South African government publications, with a focus on
understanding how South African government officials communicate with their
constituents. In this paper we highlight the process of gathering, cleaning and
making available the corpora. We create parallel sentence corpora for Neural
Machine Translation (NMT) tasks using Language-Agnostic Sentence
Representations (LASER) embeddings. With these aligned sentences we then
provide NMT benchmarks for 9 indigenous languages by fine-tuning a massively
multilingual pre-trained language model.
comment: Accepted and to appear at Fourth workshop on Resources for African
Indigenous Languages (RAIL) at EACL 2023
♻ ☆ A Simple and Effective Method of Cross-Lingual Plagiarism Detection
We present a simple cross-lingual plagiarism detection method applicable to a
large number of languages. The presented approach leverages open multilingual
thesauri for candidate retrieval task and pre-trained multilingual BERT-based
language models for detailed analysis. The method does not rely on machine
translation and word sense disambiguation when in use, and therefore is
suitable for a large number of languages, including under-resourced languages.
The effectiveness of the proposed approach is demonstrated for several existing
and new benchmarks, achieving state-of-the-art results for French, Russian, and
Armenian languages.
♻ ☆ Modeling Paragraph-Level Vision-Language Semantic Alignment for Multi-Modal Summarization
Most current multi-modal summarization methods follow a cascaded manner,
where an off-the-shelf object detector is first used to extract visual
features, then these features are fused with language representations to
generate the summary with an encoder-decoder model. The cascaded way cannot
capture the semantic alignments between images and paragraphs, which are
crucial to a precise summary. In this paper, we propose ViL-Sum to jointly
model paragraph-level \textbf{Vi}sion-\textbf{L}anguage Semantic Alignment and
Multi-Modal \textbf{Sum}marization. The core of ViL-Sum is a joint multi-modal
encoder with two well-designed tasks, image reordering and image selection. The
joint multi-modal encoder captures the interactions between modalities, where
the reordering task guides the model to learn paragraph-level semantic
alignment and the selection task guides the model to selected summary-related
images in the final summary. Experimental results show that our proposed
ViL-Sum significantly outperforms current state-of-the-art methods. In further
analysis, we find that two well-designed tasks and joint multi-modal encoder
can effectively guide the model to learn reasonable paragraphs-images and
summary-images relations.
♻ ☆ Difficulty in learning chirality for Transformer fed with SMILES
Recent years have seen development of descriptor generation based on
representation learning of extremely diverse molecules, especially those that
apply natural language processing (NLP) models to SMILES, a literal
representation of molecular structure. However, little research has been done
on how these models understand chemical structure. To address this, we
investigated the relationship between the learning progress of SMILES and
chemical structure using a representative NLP model, the Transformer. The
results suggest that while the Transformer learns partial structures of
molecules quickly, it requires extended training to understand overall
structures. Consistently, the accuracy of molecular property predictions using
descriptors generated from models at different learning steps was similar from
the beginning to the end of training. Furthermore, we found that the
Transformer requires particularly long training to learn chirality and
sometimes stagnates with low translation accuracy due to misunderstanding of
enantiomers. These findings are expected to deepen understanding of NLP models
in chemistry.
comment: 20 pages, 6 figures
♻ ☆ Beyond Universal Transformer: block reusing with adaptor in Transformer for automatic speech recognition
Transformer-based models have recently made significant achievements in the
application of end-to-end (E2E) automatic speech recognition (ASR). It is
possible to deploy the E2E ASR system on smart devices with the help of
Transformer-based models. While these models still have the disadvantage of
requiring a large number of model parameters. To overcome the drawback of
universal Transformer models for the application of ASR on edge devices, we
propose a solution that can reuse the block in Transformer models for the
occasion of the small footprint ASR system, which meets the objective of
accommodating resource limitations without compromising recognition accuracy.
Specifically, we design a novel block-reusing strategy for speech Transformer
(BRST) to enhance the effectiveness of parameters and propose an adapter module
(ADM) that can produce a compact and adaptable model with only a few additional
trainable parameters accompanying each reusing block. We conducted an
experiment with the proposed method on the public AISHELL-1 corpus, and the
results show that the proposed approach achieves the character error rate (CER)
of 9.3%/6.63% with only 7.6M/8.3M parameters without and with the ADM,
respectively. In addition, we also make a deeper analysis to show the effect of
ADM in the general block-reusing method.
♻ ☆ CokeBERT: Contextual Knowledge Selection and Embedding towards Enhanced Pre-Trained Language Models
Several recent efforts have been devoted to enhancing pre-trained language
models (PLMs) by utilizing extra heterogeneous knowledge in knowledge graphs
(KGs) and achieved consistent improvements on various knowledge-driven NLP
tasks. However, most of these knowledge-enhanced PLMs embed static sub-graphs
of KGs ("knowledge context"), regardless of that the knowledge required by PLMs
may change dynamically according to specific text ("textual context"). In this
paper, we propose a novel framework named Coke to dynamically select contextual
knowledge and embed knowledge context according to textual context for PLMs,
which can avoid the effect of redundant and ambiguous knowledge in KGs that
cannot match the input text. Our experimental results show that Coke
outperforms various baselines on typical knowledge-driven NLP tasks, indicating
the effectiveness of utilizing dynamic knowledge context for language
understanding. Besides the performance improvements, the dynamically selected
knowledge in Coke can describe the semantics of text-related knowledge in a
more interpretable form than the conventional PLMs. Our source code and
datasets will be available to provide more details for Coke.
♻ ☆ Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations
As large language models (LLMs) gain popularity among speakers of diverse
languages, we believe that it is crucial to benchmark them to better understand
model behaviors, failures, and limitations in languages beyond English. In this
work, we evaluate LLM APIs (ChatGPT, GPT-3, and GPT-4) on the Japanese national
medical licensing examinations from the past five years, including the current
year. Our team comprises native Japanese-speaking NLP researchers and a
practicing cardiologist based in Japan. Our experiments show that GPT-4
outperforms ChatGPT and GPT-3 and passes all six years of the exams,
highlighting LLMs' potential in a language that is typologically distant from
English. However, our evaluation also exposes critical limitations of the
current LLM APIs. First, LLMs sometimes select prohibited choices that should
be strictly avoided in medical practice in Japan, such as suggesting
euthanasia. Further, our analysis shows that the API costs are generally higher
and the maximum context size is smaller for Japanese because of the way
non-Latin scripts are currently tokenized in the pipeline. We release our
benchmark as Igaku QA as well as all model outputs and exam metadata. We hope
that our results and benchmark will spur progress on more diverse applications
of LLMs. Our benchmark is available at https://github.com/jungokasai/IgakuQA.
comment: Added results from the March 2023 exam
♻ ☆ Conversion of Legal Agreements into Smart Legal Contracts using NLP WWW '23
A Smart Legal Contract (SLC) is a specialized digital agreement comprising
natural language and computable components. The Accord Project provides an
open-source SLC framework containing three main modules: Cicero, Concerto, and
Ergo. Currently, we need lawyers, programmers, and clients to work together
with great effort to create a usable SLC using the Accord Project. This paper
proposes a pipeline to automate the SLC creation process with several Natural
Language Processing (NLP) models to convert law contracts to the Accord
Project's Concerto model. After evaluating the proposed pipeline, we discovered
that our NER pipeline accurately detects CiceroMark from Accord Project
template text with an accuracy of 0.8. Additionally, our Question Answering
method can extract one-third of the Concerto variables from the template text.
We also delve into some limitations and possible future research for the
proposed pipeline. Finally, we describe a web interface enabling users to build
SLCs. This interface leverages the proposed pipeline to convert text documents
to Smart Legal Contracts by using NLP models.
comment: 7 pages, Companion Proceedings of the ACM Web Conference 2023 (WWW
'23 Companion), April 30-May 4, 2023, Austin, TX, USA
♻ ☆ Sociocultural knowledge is needed for selection of shots in hate speech detection tasks
We introduce HATELEXICON, a lexicon of slurs and targets of hate speech for
the countries of Brazil, Germany, India and Kenya, to aid training and
interpretability of models. We demonstrate how our lexicon can be used to
interpret model predictions, showing that models developed to classify extreme
speech rely heavily on target words when making predictions. Further, we
propose a method to aid shot selection for training in low-resource settings
via HATELEXICON. In few-shot learning, the selection of shots is of paramount
importance to model performance. In our work, we simulate a few-shot setting
for German and Hindi, using HASOC data for training and the Multilingual
HateCheck (MHC) as a benchmark. We show that selecting shots based on our
lexicon leads to models performing better on MHC than models trained on shots
sampled randomly. Thus, when given only a few training examples, using our
lexicon to select shots containing more sociocultural information leads to
better few-shot performance.
♻ ☆ Machine Translation from Signed to Spoken Languages: State of the Art and Challenges
Automatic translation from signed to spoken languages is an interdisciplinary
research domain, lying on the intersection of computer vision, machine
translation and linguistics. Nevertheless, research in this domain is performed
mostly by computer scientists in isolation. As the domain is becoming
increasingly popular - the majority of scientific papers on the topic of sign
language translation have been published in the past three years - we provide
an overview of the state of the art as well as some required background in the
different related disciplines. We give a high-level introduction to sign
language linguistics and machine translation to illustrate the requirements of
automatic sign language translation. We present a systematic literature review
to illustrate the state of the art in the domain and then, harking back to the
requirements, lay out several challenges for future research. We find that
significant advances have been made on the shoulders of spoken language machine
translation research. However, current approaches are often not linguistically
motivated or are not adapted to the different input modality of sign languages.
We explore challenges related to the representation of sign language data, the
collection of datasets, the need for interdisciplinary research and
requirements for moving beyond research, towards applications. Based on our
findings, we advocate for interdisciplinary research and to base future
research on linguistic analysis of sign languages. Furthermore, the inclusion
of deaf and hearing end users of sign language translation applications in use
case identification, data collection and evaluation is of the utmost importance
in the creation of useful sign language translation models. We recommend
iterative, human-in-the-loop, design and development of sign language
translation models.
comment: This is the version of the article submitted to peer review to
Universal Access in the Information Society. Please refer to "De Coster, M.,
Shterionov, D., Van Herreweghe, M. et al. Machine translation from signed to
spoken languages: state of the art and challenges. Univ Access Inf Soc
(2023)." for the published and updated version
♻ ☆ InstructABSA: Instruction Learning for Aspect Based Sentiment Analysis
In this paper, we present InstructABSA, Aspect Based Sentiment Analysis
(ABSA) using the instruction learning paradigm for all ABSA subtasks: Aspect
Term Extraction (ATE), Aspect Term Sentiment Classification (ATSC), and Joint
Task modeling. Our method introduces positive, negative, and neutral examples
to each training sample, and instruction tunes the model (Tk-Instruct) for each
ABSA subtask, yielding significant performance improvements. Experimental
results on the Sem Eval 2014, 15, and 16 datasets demonstrate that InstructABSA
outperforms the previous state-of-the-art (SOTA) approaches on all three ABSA
subtasks (ATE, ATSC, and Joint Task) by a significant margin, outperforming 7x
larger models. In particular, InstructABSA surpasses the SOTA on the Rest14 ATE
subtask by 7.31% points, Rest15 ATSC subtask by and on the Lapt14 Joint Task by
8.63% points. Our results also suggest a strong generalization ability to new
domains across all three subtasks
comment: 4 pages, 2 figures, 5 tables, 5 appendix pages
♻ ☆ To ChatGPT, or not to ChatGPT: That is the question!
ChatGPT has become a global sensation. As ChatGPT and other Large Language
Models (LLMs) emerge, concerns of misusing them in various ways increase, such
as disseminating fake news, plagiarism, manipulating public opinion, cheating,
and fraud. Hence, distinguishing AI-generated from human-generated becomes
increasingly essential. Researchers have proposed various detection
methodologies, ranging from basic binary classifiers to more complex
deep-learning models. Some detection techniques rely on statistical
characteristics or syntactic patterns, while others incorporate semantic or
contextual information to improve accuracy. The primary objective of this study
is to provide a comprehensive and contemporary assessment of the most recent
techniques in ChatGPT detection. Additionally, we evaluated other AI-generated
text detection tools that do not specifically claim to detect ChatGPT-generated
content to assess their performance in detecting ChatGPT-generated content. For
our evaluation, we have curated a benchmark dataset consisting of prompts from
ChatGPT and humans, including diverse questions from medical, open Q&A, and
finance domains and user-generated responses from popular social networking
platforms. The dataset serves as a reference to assess the performance of
various techniques in detecting ChatGPT-generated content. Our evaluation
results demonstrate that none of the existing methods can effectively detect
ChatGPT-generated content.
♻ ☆ Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding
Despite recent advances of AI, story understanding remains an open and
under-investigated problem. We collect, preprocess, and publicly release a
video-language story dataset, Synopses of Movie Narratives (SyMoN), containing
5,193 video summaries of popular movies and TV series with a total length of
869 hours. SyMoN captures naturalistic storytelling videos made by human
creators and intended for a human audience. As a prototypical and naturalistic
story dataset, SyMoN features high coverage of multimodal story events and
abundant mental-state descriptions. Its use of storytelling techniques cause
cross-domain semantic gaps that provide appropriate challenges to existing
models. We establish benchmarks on video-text retrieval and zero-shot alignment
on movie summary videos, which showcase the importance of in-domain data and
long-term memory in story understanding. With SyMoN, we hope to lay the
groundwork for progress in multimodal story understanding.
comment: 25 pages, 17 figures
♻ ☆ Dual-Attention Neural Transducers for Efficient Wake Word Spotting in Speech Recognition ICASSP 2023
Saumya Y. Sahai, Jing Liu, Thejaswi Muniyappa, Kanthashree M. Sathyendra, Anastasios Alexandridis, Grant P. Strimel, Ross McGowan, Ariya Rastrow, Feng-Ju Chang, Athanasios Mouchtaris, Siegfried Kunzmann
We present dual-attention neural biasing, an architecture designed to boost
Wake Words (WW) recognition and improve inference time latency on speech
recognition tasks. This architecture enables a dynamic switch for its runtime
compute paths by exploiting WW spotting to select which branch of its attention
networks to execute for an input audio frame. With this approach, we
effectively improve WW spotting accuracy while saving runtime compute cost as
defined by floating point operations (FLOPs). Using an in-house de-identified
dataset, we demonstrate that the proposed dual-attention network can reduce the
compute cost by $90\%$ for WW audio frames, with only $1\%$ increase in the
number of parameters. This architecture improves WW F1 score by $16\%$ relative
and improves generic rare word error rate by $3\%$ relative compared to the
baselines.
comment: Accepted to Proc. IEEE ICASSP 2023
♻ ☆ The StatCan Dialogue Dataset: Retrieving Data Tables through Conversations with Genuine Intents EACL 2023
We introduce the StatCan Dialogue Dataset consisting of 19,379 conversation
turns between agents working at Statistics Canada and online users looking for
published data tables. The conversations stem from genuine intents, are held in
English or French, and lead to agents retrieving one of over 5000 complex data
tables. Based on this dataset, we propose two tasks: (1) automatic retrieval of
relevant tables based on a on-going conversation, and (2) automatic generation
of appropriate agent responses at each turn. We investigate the difficulty of
each task by establishing strong baselines. Our experiments on a temporal data
split reveal that all models struggle to generalize to future conversations, as
we observe a significant drop in performance across both tasks when we move
from the validation to the test set. In addition, we find that response
generation models struggle to decide when to return a table. Considering that
the tasks pose significant challenges to existing models, we encourage the
community to develop models for our task, which can be directly used to help
knowledge workers find relevant tables for live chat users.
comment: Accepted at EACL 2023